MLP yes ! ILP no !
نویسنده
چکیده
Problem Description: It should be well known that processors are outstripping memory performance: specifically that memory latencies are not improving as fast as processor cycle time or IPC or memory bandwidth. Thought experiment: imagine that a cache miss takes 10000 cycles to execute. For such a processor instruction level parallelism is useless, because most of the time is spent waiting for memory. Branch prediction is also less effective, since most branches can be determined with data already in registers or in the cache; branch prediction only helps for branches which depend on outstanding cache misses. At the same time, pressures for reduced power consumption mount. Given such trends, some computer architects in industry (although not Intel EPIC) are talking seriously about retreating from out-of-order superscalar processor architecture, and instead building simpler, faster, dumber, 1-wide in-order processors with high degrees of speculation. Sometimes this is proposed in combination with multiprocessing and multithreading: tolerate long memory latencies by switching to other processes or threads. I propose something different: build narrow fast machines but use intelligent logic inside the CPU to increase the number of outstanding cache misses that can be generated from a single program. By MLP I mean simply the number of outstanding cache misses that can be generated (by a single thread, task, or program) and executed in an overlapped manner. It does not matter what sort of execution engine generates the multiple outstanding cache misses. An out-of-order superscalar ILP CPU may generate multiple outstanding cache misses, but 1-wide processors can be just as effective. Change the metrics: total execution time remains the overall goal, but instead of reporting IPC as an approximation to this, we must report MLP. Limit studies should be in terms of total number of non-overlapped cache misses on critical path. Now do the research: Many present-day hot topics in computer architecture help ILP, but do not help MLP. As mentioned above, predicting branch directions for branches that can be determined from data already in the cache or in registers does not help MLP for extremely long latencies. Similarly, prefetching of data cache misses for array processing codes does not help MLP – it just moves it around. Instead, investigate microarchitectures that help MLP: (0) Trivial case – explicit multithreading, like SMT. (1) Slightly less trivial case – implicitly multithread single programs, either by compiler software on an MT machine, or by a hybrid, such as …
منابع مشابه
An approximate dynamic programming approach for improving accuracy of lossy data compression by Bloom filters
Bloom filters are a data structure for storing data in a compressed form. They offer excellent space and time efficiency at the cost of some loss of accuracy (so-called lossy compression). This work presents a yes–no Bloom filter, which as a data structure consisting of two parts: the yes-filter which is a standard Bloom filter and the no-filter which is another Bloom filter whose purpose is to...
متن کاملMLP-Aware Dynamic Instruction Window Resizing in Superscalar Processors for Adaptively Exploiting Available Parallelism
Single-thread performance has not improved much over the past few years, despite an ever increasing transistor budget. One of the reasons for this is that there is a speed gap between the processor and main memory, known as the memory wall. A promising method to overcome this memory wall is aggressive out-of-order execution by extensively enlarging the instruction window resources to exploit me...
متن کاملShort - Term Load Forecasting
This paper presents a novel hybrid method for short-term load forecasting. The system comprises of two artificial neural networks (ANN), assembled in a hierarchical order. The first ANN is a multilayer perceptron (MLP) which functions as integrated load predictor (ILP) for the forecasting day. The output of the ILP is then fed to another, more complex MLP, which acts as an hourly load predictor...
متن کاملSampling Methods for Ilp
This paper is concerned with problems that arise when submitting large quantities of data to analysis by an Inductive Logic Programming (ILP) system. Complexity arguments usually make it prohibitive to analyse such datasets in their entirety. We examine two schemes that allow an ILP system to construct theories by sampling from this large pool of data. The rst, \subsampling", is a single-sample...
متن کاملUsing Performance Bounds to Guide Code Compilation and Processor Design
Performance bounds represent the best achievable performance that can be delivered by target microarchitectures on specified workloads. Accurate performance bounds establish an efficient way to evaluate the performance potential of either code optimizations or architectural innovations. We advocate using performance bounds to guide code compilation. In this dissertation, we introduce a novel bo...
متن کامل